Skip to main content

Web Crawls

The Web Archive of the Internet Archive started in late 1996, is made available through the Wayback Machine, and some collections are available in bulk to researchers. Many pages are archived by the Internet Archive for other contributors including partners of Archive-IT, and Save Page Now users. Other captures are donated to the Internet Archive by other partners such as Alexa Internet.

2,258,067
RESULTS
rss


TOPIC atoz
crawldata 887,680
wiki 555,898
dumps 529,568
incremental 504,371
Wikipedia 201,560
Wiktionary 105,532
Wikibooks 55,562
Wikiquote 48,028
Wikisource 45,157
Wikimedia 29,226
wikiteam 26,416
MediaWiki 26,295
data dumps 25,176
no404 24,482
Wikinews 20,930
English 20,128
unknowncopyright 16,006
archiveteam 12,988
wikipedia 12,644
Wikivoyage 11,932
wordpress 11,437
Wikiversity 10,771
Italian 5,756
French 5,755
Greek 5,754
German 5,753
Spanish 5,753
Swedish 5,718
Portuguese 5,716
Russian 5,716
Portuguese Web Archive 5,097
Portuguese online publications 5,097
Czech 5,046
Arabic 5,045
Japanese 5,044
Finnish 5,042
Korean 5,039
Hebrew 5,036
Ukrainian 5,011
Polish 5,003
Romanian 4,999
Chinese 4,977
Persian 4,922
Bosnian 4,326
Catalan 4,326
Bulgarian 4,325
Dutch 4,325
Esperanto 4,323
Norwegian 4,300
Serbian 4,296
Tamil 4,295
Turkish 4,295
Vietnamese 4,294
Slovenian 4,292
www.dailymail.co.uk 4,225
WARC 3,939
archive 3,907
snapshot 3,894
Arcmaj3 3,862
Arcmaj3BarrelData 3,740
Hungarian 3,629
Thai 3,611
Welsh 3,607
Azerbaijani 3,606
Croatian 3,606
Lithuanian 3,606
Belarusian 3,605
Estonian 3,605
Limburgish 3,605
Armenian 3,604
Galician 3,604
Marathi 3,604
Latin 3,603
Malayalam 3,603
Danish 3,602
Indonesian 3,602
Icelandic 3,598
Telugu 3,583
Albanian 3,580
Slovak 3,578
Sanskrit 3,577
media 3,496
tape 3,494
Complete crawl of the Portuguese web 3,425
Gujarati 2,911
Kannada 2,911
metro.co.uk 2,897
Breton 2,886
Georgian 2,886
Afrikaans 2,885
Basque 2,885
Bengali 2,885
Hindi 2,885
Kyrgyz 2,884
Macedonian 2,883
Kurdish 2,882
website 2,865
Urdu 2,864
forum 2,711
web archive 2,704
discussion forum 2,699
Chinese (Min Nan) 2,217
Kazakh 2,191
Uzbek 2,185
Sundanese 2,183
Tatar 2,179
Faroese 2,166
Malagasy 2,166
Western Frisian 2,166
Interlingua 2,165
Khmer 2,165
Malay 2,164
Nepali 2,160
Norwegian Nynorsk 2,156
Occitan 2,156
Wolof 2,152
Yiddish 2,152
Tajik 2,151
Tagalog 2,150
Punjabi 2,149
Sinhala 2,147
Venetian 2,145
Oriya 2,019
Old English 1,742
Interlingue 1,688
Incremental crawl of the Portuguese web 1,672
gardening 1,578
horticulture 1,577
plants 1,577
gardeners 1,576
Asturian 1,501
Corsican 1,501
Irish 1,500
Luxembourgish 1,499
Nauru 1,499
Kashmiri 1,497
Low German 1,496
Assamese 1,494
Quechua 1,490
Simple English 1,489
Uyghur 1,489
Turkmen 1,488
Amharic 1,479
Aymara 1,473
Guarani 1,473
Latvian 1,473
Lingala 1,473
Mongolian 1,473
Maori 1,472
Burmese 1,471
Cornish 1,471
Walloon 1,465
Zulu 1,465
Pashto 1,463
Swahili 1,462
Sindhi 1,460
Aragonese 1,447
Chuvash 1,447
Kashubian 1,447
Hausa 1,446
Lao 1,446
Manx 1,446
Scottish Gaelic 1,446
Upper Sorbian 1,446
Cherokee 1,445
Fijian 1,445
Inuktitut 1,445
Javanese 1,445
Lojban 1,445
Divehi 1,444
Ido 1,444
Kalaallisut 1,444
Maltese 1,444
Oromo 1,440
Aromanian 1,436
Kinyarwanda 1,436
Samoan 1,436
Somali 1,435
Southern Sotho 1,435
Sango 1,434
Serbo-Croatian 1,434
Sicilian 1,434
Swati 1,434
Tigrinya 1,434
Tok Pisin 1,434
Tswana 1,434
Tsonga 1,433
Sakha 1,432
Western Punjabi 1,429
html 1,397
htmldumps 1,389
Nāhuatl 1,366
www.telegraph.co.uk 1,317
videobot 1,213
New York City 1,129
theater 1,129
Broadway 1,128
London 1,128
theatre 1,128
musicals 1,127
LANGUAGE
English 67,612
Portuguese 5,306
German 1,700
Spanish 958
Russian 888
French 705
Dutch 357
Korean 279
Chinese 246
Italian 242
SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
Internet Archive Web Crawls
collection
722,183
ITEMS
4.7B
VIEWS
collection
eye 4.7B
The Internet Archive discovers and captures web pages through many different web crawls. At any given time several distinct crawls are running, some for months, and some every day or longer. View the web archive through the Wayback Machine.
Topic: webwidecrawl
Alexa Crawls
collection
129,649
ITEMS
2.2B
VIEWS
collection
eye 2.2B
Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
Topics: web crawl, Alexa
Worldwide Web Crawls
collection
373,183
ITEMS
1.9B
VIEWS
collection
eye 1.9B
Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites. Since September 10th, 2010, the Internet Archive has been running Worldwide Web Crawls of the global web, capturing web elements, pages, sites and parts of sites. Each Worldwide Web Crawl was initiated from one or more lists of URLs that are known as "Seed Lists". Descriptions of the Seed Lists associated with each crawl may be provided as part of the metadata for...
Live Web Proxy Crawls
collection
12,283
ITEMS
1.1B
VIEWS
collection
eye 1.1B
Content crawled via the Wayback Machine Live Proxy mostly by the Save Page Now feature on web.archive.org. Liveweb proxy is a component of Internet Archive’s wayback machine project. The liveweb proxy captures the content of a web page in real time, archives it into a ARC or WARC file and returns the ARC/WARC record back to the wayback machine to process. The recorded ARC/WARC file becomes part of the wayback machine in due course of time.
Survey Crawls
collection
63,876
ITEMS
977.9M
VIEWS
collection
eye 977.9M
Survey crawls are run about twice a year, on average, and attempt to capture the content of the front page of every web host ever seen by the Internet Archive since 1996.
Topic: survey crawls
Archive-It Digital Collection
collection
158,682
ITEMS
401.7M
VIEWS
collection
eye 401.7M
Archive-It is a subscription web archiving service of the Internet Archive that helps organizations harvest, build, and preserve collections of digital content. Partners create domain specific collections of web captures that can be searched on Archive It. Content is hosted and stored at the Internet Archive data centers. Archive-It works with more than 400 partner organizations in 48 U.S. states and 16 countries worldwide including: College and University Libraries State Archives, Libraries,...
Topic: Colleges, Universities, Libraries, Archives, NGOs, Museums
Survey Crawl April 2013
collection
16,282
ITEMS
381.3M
VIEWS
collection
eye 381.3M
Survey crawl of domains started April 2013. This data is currently not publicly accessible.
Focused Crawls
collection
63,197
ITEMS
380M
VIEWS
by Internet Archive
collection
eye 380M
Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
Topic: webcrawl
web-group-internal
collection
29,695
ITEMS
347.8M
VIEWS
collection
eye 347.8M
miscellaneous data
Topic: brad tofel
Custom Crawl Services
collection
44,544
ITEMS
330.8M
VIEWS
by Internet Archive
collection
eye 330.8M
National library harvesting.
Topic: ccs
Wide Crawl started April 2013
collection
25,005
ITEMS
305.6M
VIEWS
collection
eye 305.6M
Web wide crawl with initial seedlist and crawler configuration from April 2013.
Wayback Indexes
collection
554
ITEMS
292.6M
VIEWS
collection
eye 292.6M
Wayback indexes. This data is currently not publicly accessible.
alexa_2007
collection
7,636
ITEMS
251.8M
VIEWS
collection
eye 251.8M
this data is currently not publicly accessible.
Top Domains
collection
59,702
ITEMS
249.5M
VIEWS
collection
eye 249.5M
A daily collection of thousands of the most popular web sites according to Alexa.com's top sites rankings.
Topics: daily, popular sites, Alexa
Fix Broken Links Web Crawls
collection
36,565
ITEMS
221.9M
VIEWS
collection
eye 221.9M
These crawls are part of an effort to archive pages as they are created and archive the pages that they refer to. That way, as the pages that are referenced are changed or taken from the web, a link to the version that was live when the page was written will be preserved. Then the Internet Archive hopes that references to these archived pages will be put in place of a link that would be otherwise be broken, or a companion link to allow people to see what was originally intended by a page's...
Survey Crawl December 2014
collection
11,190
ITEMS
194.7M
VIEWS
collection
eye 194.7M
Survey crawl of domains started December 2014. This data is currently not publicly accessible.
Wide Crawl started August 2013
collection
21,909
ITEMS
188.2M
VIEWS
collection
eye 188.2M
Web wide crawl with initial seedlist and crawler configuration from August 2013.
alexa_2006
collection
6,507
ITEMS
183.7M
VIEWS
collection
eye 183.7M
this data is currently not publicly accessible.
Wide Crawl started January 2012
collection
30,362
ITEMS
183.6M
VIEWS
collection
eye 183.6M
Web wide crawl with initial seedlist and crawler configuration from January 2012 using HQ software.
Wide Crawl started June 2014
collection
45,313
ITEMS
182M
VIEWS
collection
eye 182M
Web wide crawl with initial seedlist and crawler configuration from June 2014.
Wide Crawl started April 2012
collection
32,586
ITEMS
170.2M
VIEWS
collection
eye 170.2M
Web wide crawl with initial seedlist and crawler configuration from April 2012.
Wiki Collections
collection
727,346
ITEMS
146.5M
VIEWS
collection
eye 146.5M
Collections of Wiki data
Topics: crawls, data, wiki
Wikipedia Outlinks
collection
11,511
ITEMS
144.1M
VIEWS
collection
eye 144.1M
Crawl of outlinks from wikipedia.org. These files are currently not publicly accessible. from Wikipedia: Wikipedia is a multilingual, web-based, free-content encyclopedia project operated by the Wikimedia Foundation and based on an openly editable model. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links to guide the user...
Web Wide Crawl started March, 14th 2015
collection
49,621
ITEMS
138.7M
VIEWS
collection
eye 138.7M
Web wide crawl with initial seedlist and crawler configuration from January 2015.
Archive Team
collection
64,496
ITEMS
124.4M
VIEWS
collection
eye 124.4M
Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history. History is littered with hundreds of conflicts over the future of a community, group, location or...
Wide Crawl started October 2010
collection
15,839
ITEMS
124.2M
VIEWS
collection
eye 124.2M
Web wide crawl with initial seedlist and crawler configuration from October 2010
Wide Crawl started September 2012
collection
22,402
ITEMS
120.8M
VIEWS
collection
eye 120.8M
Web wide crawl with initial seedlist and crawler configuration from September 2012.
Wide Crawl Started January 2013
collection
15,138
ITEMS
120.8M
VIEWS
collection
eye 120.8M
Wide crawls of the Internet conducted by Internet Archive. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
Wikipedia Outbound Links
collection
11,651
ITEMS
120.3M
VIEWS
collection
eye 120.3M
This is a collection of web page captures from links added to, or changed on, Wikipedia pages. The idea is to bring a reliability to Wikipedia outlinks so that if the pages referenced by Wikipedia articles are changed, or go away, a reader can permanently find what was originally referred to. This is part of the Internet Archive's attempt to rid the web of broken links.
Topics: Wikipedia, Wikimedia
Around The World Crawl
collection
2,150
ITEMS
119.4M
VIEWS
collection
eye 119.4M
Data crawled by Sloan Foundation on behalf of Internet Archive
Wide Crawl started October 2011
collection
10,122
ITEMS
108.1M
VIEWS
collection
eye 108.1M
Web wide crawl with initial seedlist and crawler configuration from March 2011 using HQ software.
.com survey started January 2011
collection
2,535
ITEMS
104.3M
VIEWS
collection
eye 104.3M
Survey crawl of .com domains started January 2011.
Topic: webcrawl
Survey Crawl May 2014
collection
6,909
ITEMS
104.2M
VIEWS
collection
eye 104.2M
Survey crawl of domains started May 2014. This data is currently not publicly accessible.
Top News
collection
44,089
ITEMS
101.5M
VIEWS
collection
eye 101.5M
A daily collection of hundreds of the world's top news sites.
Topics: daily, news
Survey Crawl started July 2015
collection
10,137
ITEMS
97.6M
VIEWS
collection
eye 97.6M
Survey crawl of domains. This data is currently not publicly accessible.
Wide Crawl started March 2011
collection
8,528
ITEMS
96.6M
VIEWS
collection
eye 96.6M
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi. What’s in the data set: Crawl start date: 09 March, 2011 Crawl end date: 23 December, 2011 Number of captures: 2,713,676,341 Number of unique URLs: 2,273,840,159 Number of hosts: 29,032,069 The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT)...
38_crawl
collection
1,387
ITEMS
86.7M
VIEWS
collection
eye 86.7M
this data is currently not publicly accessible.
Wide Crawl started February 2014
collection
9,789
ITEMS
82.8M
VIEWS
collection
eye 82.8M
Web wide crawl with initial seedlist and crawler configuration from February 2014.
alexa_web_2009
collection
3,080
ITEMS
78.6M
VIEWS
collection
eye 78.6M
this data is currently not publicly accessible.
ArchiveBot: The Archive Team Crowdsourced Crawler
collection
1,614
ITEMS
78.5M
VIEWS
collection
eye 78.5M
ArchiveBot is an IRC bot designed to automate the archival of smaller websites (e.g. up to a few hundred thousand URLs). You give it a URL to start at, and it grabs all content under that URL, records it in a WARC, and then uploads that WARC to ArchiveTeam servers for eventual injection into the Internet Archive (or other archive sites). To use ArchiveBot, drop by #archivebot on EFNet. To interact with ArchiveBot, you issue commands by typing it into the channel. Note you will need channel...
Topics: archiveteam, archivebot, webcrawl, robot, love
alexa_web_2010
collection
2,994
ITEMS
76.6M
VIEWS
collection
eye 76.6M
this data is currently not publicly accessible.
Alexa Crawl EG
collection
1,678
ITEMS
75.9M
VIEWS
collection
eye 75.9M
Crawl EG from Alexa Internet. This data is currently not publicly accessible.
National Library of Australia Crawls
collection
11,047
ITEMS
75.3M
VIEWS
collection
eye 75.3M
Crawls performed by Internet Archive on behalf of the National Library of Australia. This data is currently not publicly accessible.
Wikipedia Outlinks February 2012
collection
2,951
ITEMS
74.8M
VIEWS
collection
eye 74.8M
Crawl of outlinks from wikipedia.org started February, 2012. These files are currently not publicly accessible.
web_iq
collection
2,650
ITEMS
71.6M
VIEWS
collection
eye 71.6M
Crawl performed by Internet Archive. This data is currently not publicly accessible.
web_wk
collection
9,978
ITEMS
69.3M
VIEWS
collection
eye 69.3M
Crawl performed by Internet Archive. This data is currently not publicly accessible.
Wordpress Blogs and the Pages They Link To
collection
11,434
ITEMS
69.1M
VIEWS
collection
eye 69.1M
This is a collection of pages and embedded objects from WordPress blogs and the external pages they link to. Captures of these pages are made on a continuous basis seeded from a feed of new or changed pages hosted by Wordpress.com or by Wordpress pages hosted by sites running a properly configured Jetpack wordpress plugin.
Topics: Wordpress.com, blogs, jetpack
Wide Crawl Number 13
collection
45,739
ITEMS
64.8M
VIEWS
collection
eye 64.8M
Web Wide Crawl Number 13
National Library of Spain
collection
6,722
ITEMS
62M
VIEWS
collection
eye 62M
Data collected by Internet Archive on behalf of the National Library of Spain. This data is currently not publicly accessible.
26_crawl
collection
1,466
ITEMS
55.8M
VIEWS
collection
eye 55.8M
this data is currently not publicly accessible.
Survey Crawl
collection
9,928
ITEMS
53M
VIEWS
collection
eye 53M
Survey crawl of domains. This data is currently not publicly accessible.
51_crawl
collection
1,138
ITEMS
52.4M
VIEWS
collection
eye 52.4M
this data is currently not publicly accessible.
Bibliotheque Nationale de France Domain Crawls
collection
1,652
ITEMS
49.3M
VIEWS
collection
eye 49.3M
Crawls of the french domain space performed by Internet Archive on behalf of Bibliotheque Nationale de France. This data is currently not publicly accessible.
52_crawl
collection
2,589
ITEMS
48.2M
VIEWS
collection
eye 48.2M
this data is currently not publicly accessible.
35_crawl
collection
1,179
ITEMS
43.3M
VIEWS
collection
eye 43.3M
this data is currently not publicly accessible.
Alexa Crawls DF
collection
248
ITEMS
42.5M
VIEWS
collection
eye 42.5M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
Shallow Crawls
collection
1,042
ITEMS
41.6M
VIEWS
collection
eye 41.6M
Shallow crawls that collect content 1 level deep including embeds. This data is currently not publicly accessible.
alexa_1999
collection
243
ITEMS
38.8M
VIEWS
collection
eye 38.8M
this data is currently not publicly accessible.
Alexa Crawl EI
collection
1,408
ITEMS
38.4M
VIEWS
collection
eye 38.4M
Crawl EI from Alexa Internet. This data is currently not publicly accessible.
International News Crawls
collection
3,388
ITEMS
38.2M
VIEWS
collection
eye 38.2M
Crawls of International News Sites
web_el_2008
collection
1,705
ITEMS
37.9M
VIEWS
collection
eye 37.9M
This data is currently not publicly accessible.
Alexa Crawl DX
collection
1,442
ITEMS
37.6M
VIEWS
collection
eye 37.6M
Crawl DX from Alexa Internet. This data is currently not publicly accessible.
Alexa Crawls DO
collection
493
ITEMS
37M
VIEWS
collection
eye 37M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
web_mon
collection
579
ITEMS
36.1M
VIEWS
collection
eye 36.1M
Crawl performed by Internet Archive. This data is currently not publicly accessible.
29_crawl
collection
1,568
ITEMS
36M
VIEWS
collection
eye 36M
this data is currently not publicly accessible.
Wikipedia Outlinks May 2011
collection
1,638
ITEMS
35.4M
VIEWS
collection
eye 35.4M
Crawl of outlinks from wikipedia.org started May, 2011. These files are currently not publicly accessible.
Archive-It Partners
collection
37,886
ITEMS
34.7M
VIEWS
collection
eye 34.7M
Archive-It is the leading web archiving service for collecting and accessing cultural heritage on the web and is a service of Internet Archive used by libraries, archives, governments, non-profits, and other organizations to build collections of web materials.
Topic: TK
Alexa Crawls DY
collection
1,326
ITEMS
34.7M
VIEWS
collection
eye 34.7M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
Alexa Crawls EA
collection
1,315
ITEMS
34.5M
VIEWS
collection
eye 34.5M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
Topic: crawldata
20th Century Web
collection
331
ITEMS
34.3M
VIEWS
collection
eye 34.3M
Collection of web items from the 20th century.
Topics: web, 20th century
web_tran
collection
4,193
ITEMS
33.7M
VIEWS
collection
eye 33.7M
Crawl performed by Internet Archive. This data is currently not publicly accessible.
Internet Archive Global Events
collection
6,701
ITEMS
32.4M
VIEWS
collection
eye 32.4M
Internet Archive Global EventsArchive-It Partner Since: Feb, 2006Organization Type: Other InstitutionsOrganization URL:http://www.archive-it.org
alexa_ed
collection
1,185
ITEMS
32.2M
VIEWS
collection
eye 32.2M
this data is currently not publicly accessible.
Alexa Crawl DZ
collection
1,207
ITEMS
32M
VIEWS
collection
eye 32M
Crawl DZ from Alexa Internet. This data is currently not publicly accessible.
Alexa Crawl EH
collection
1,218
ITEMS
31.7M
VIEWS
collection
eye 31.7M
Crawl EH from Alexa Internet. This data is currently not publicly accessible.